Implement update for remove-snapshots action#1561
Conversation
remove-snapshot actionremove-snapshots action
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks! This was one of the missing update functions mentioned in #952
Do you mind also including some tests? Similar to
iceberg-python/tests/table/test_init.py
Lines 663 to 682 in 2cd4e78
|
Sure! Thanks for the fast review |
kevinjqliu
left a comment
There was a problem hiding this comment.
left some comments, looks like theres also a linter error, do you mind running make lint locally?
|
|
||
|
|
||
| @_apply_table_update.register(RemoveSnapshotsUpdate) | ||
| def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: |
There was a problem hiding this comment.
pyiceberg/table/update/__init__.py
Outdated
| def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: | ||
| for remove_snapshot_id in update.snapshot_ids: | ||
| if remove_snapshot_id == base_metadata.current_snapshot_id: | ||
| raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}") |
There was a problem hiding this comment.
should we block the current snapshot?
There was a problem hiding this comment.
I'm not an expert in iceberg spec, but it's not clear what should happen if you try to remove the current snapshot.
I'm also not sure if I should update parent_snapshot_id in every snapshot that was referencing removed snapshots
There was a problem hiding this comment.
Decided to set parent_snapshot_id to None if the parent is gone
There was a problem hiding this comment.
it's not clear what should happen if you try to remove the current snapshot.
im looking at the java implementation for answers, i think you can just remove the current snapshot... because you can have an empty table with no snapshots
There was a problem hiding this comment.
Created a separate pr for remove-snapshot-ref and added a unit test there #1598
|
Hey @kevinjqliu, ready for another review round. I had to cherry pick the changes from #822 to reuse the code that removes refs |
|
Fixed the linters |
|
Removed corresponding statistics files and entries from snapshot log |
kevinjqliu
left a comment
There was a problem hiding this comment.
I added a few comments to the PR. I think we might need some integrations tests to make sure the behavior aligns with the java implementation.
Also, looks like remove_tag and remove_branch are unrelated to this PR, perhaps we can move them to a separate PR.
pyiceberg/table/update/__init__.py
Outdated
| def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: | ||
| for remove_snapshot_id in update.snapshot_ids: | ||
| if remove_snapshot_id == base_metadata.current_snapshot_id: | ||
| raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}") |
There was a problem hiding this comment.
it's not clear what should happen if you try to remove the current snapshot.
im looking at the java implementation for answers, i think you can just remove the current snapshot... because you can have an empty table with no snapshots
tests/table/test_init.py
Outdated
| assert len(table_v2.metadata.snapshots) == 2 | ||
| assert len(table_v2.metadata.snapshot_log) == 2 | ||
| assert len(table_v2.metadata.refs) == 2 | ||
| update = RemoveSnapshotsUpdate(snapshot_ids=[3051729675574597004]) |
There was a problem hiding this comment.
nit can you make 3051729675574597004 a constant for readability?
There was a problem hiding this comment.
Added constants REMOVE_SNAPSHOT and KEEP_SNAPSHOT
| with pytest.raises(ValueError, match="Can't remove current snapshot id 3055729675574597004"): | ||
| update_table_metadata(table_v2.metadata, (update,)) | ||
|
|
||
|
|
There was a problem hiding this comment.
lets also add some tests for RemoveSnapshotRefUpdate
There was a problem hiding this comment.
Created a separate pr for remove-snapshot-ref and added a unit test there #1598
28c6657 to
19e17f0
Compare
c8c63ea to
32e1e85
Compare
grihabor
left a comment
There was a problem hiding this comment.
Thank you for review! Could you explain which kind of integration tests you want? Like pyspark integration with expire_snapshots call?
pyiceberg/table/update/__init__.py
Outdated
| def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: | ||
| for remove_snapshot_id in update.snapshot_ids: | ||
| if remove_snapshot_id == base_metadata.current_snapshot_id: | ||
| raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}") |
There was a problem hiding this comment.
Created a separate pr for remove-snapshot-ref and added a unit test there #1598
| with pytest.raises(ValueError, match="Can't remove current snapshot id 3055729675574597004"): | ||
| update_table_metadata(table_v2.metadata, (update,)) | ||
|
|
||
|
|
There was a problem hiding this comment.
Created a separate pr for remove-snapshot-ref and added a unit test there #1598
tests/table/test_init.py
Outdated
| assert len(table_v2.metadata.snapshots) == 2 | ||
| assert len(table_v2.metadata.snapshot_log) == 2 | ||
| assert len(table_v2.metadata.refs) == 2 | ||
| update = RemoveSnapshotsUpdate(snapshot_ids=[3051729675574597004]) |
There was a problem hiding this comment.
Added constants REMOVE_SNAPSHOT and KEEP_SNAPSHOT
pyiceberg/table/update/__init__.py
Outdated
|
|
||
| snapshots = [ | ||
| (s.model_copy(update={"parent_snapshot_id": None}) if s.parent_snapshot_id in update.snapshot_ids else s) | ||
| for s in base_metadata.snapshots |
There was a problem hiding this comment.
Can we make this a little more verbose:
| for s in base_metadata.snapshots | |
| for snapshot in base_metadata.snapshots |
|
@kevinjqliu Sure, merged wth upstream main |
No description provided.